An In-Depth Analysis of Gun Violence in America¶

Will M, Zichao L, Ethan B¶

Part 1: Introduction¶

Gun violence has become a significant problem in America today. We are constantly reminded by news reports and social media that gun violence is a part of our lives - as a result, our lives are being disrupted by this threat. Schools are enforcing shooting drills, products like bulletproof vests are becoming ever more common, and our politics are being divided over what the right thing to do is.

In 2020, gun violence was the most common cause of death among people younger than 19. Between 1968 and 2011, an estimated 1.4 million Americans died from gun violence. The gun-related homicide rate in the United States is 25 times higher than in other developed countries. Because of these statistics, it makes sense that the general public be informed about this issue.

In this tutorial, we will do an in-depth analysis of the history, causes and effects of gun violence. The data we will be using can be found <a id = "https://github.com/jamesqo/gun-violence-data"here>. The ultimate goal is to understand the factors that contribute the most to gun violence.

Part 2: Data¶

We will start by importing the necesary packages.

In [180]:
import pandas as pd
import numpy as np

The first thing we need to do is to read in our data. This can be done with pandas, and here is the result:

In [181]:
data = pd.read_csv("stage3.csv")
data.head()
Out[181]:
incident_id date state city_or_county address n_killed n_injured incident_url source_url incident_url_fields_missing ... participant_age participant_age_group participant_gender participant_name participant_relationship participant_status participant_type sources state_house_district state_senate_district
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 http://www.gunviolencearchive.org/incident/461105 http://www.post-gazette.com/local/south/2013/0... False ... 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female 0::Julian Sims NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:... http://pittsburgh.cbslocal.com/2013/01/01/4-pe... NaN NaN
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 http://www.gunviolencearchive.org/incident/460726 http://www.dailybulletin.com/article/zz/201301... False ... 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male 0::Bernard Gillis NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:... http://losangeles.cbslocal.com/2013/01/01/man-... 62.0 35.0
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 http://www.gunviolencearchive.org/incident/478855 http://chronicle.northcoastnow.com/2013/02/14/... False ... 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male 0::Damien Bell||1::Desmen Noble||2::Herman Sea... NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic... http://www.morningjournal.com/general-news/201... 56.0 13.0
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 http://www.gunviolencearchive.org/incident/478925 http://www.dailydemocrat.com/20130106/aurora-s... False ... 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male 0::Stacie Philbrook||1::Christopher Ratliffe||... NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su... http://denver.cbslocal.com/2013/01/06/officer-... 40.0 28.0
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 http://www.gunviolencearchive.org/incident/478959 http://www.journalnow.com/news/local/article_d... False ... 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 0::Danielle Imani Jameison||1::Maurice Eugene ... 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su... http://myfox8.com/2013/01/08/update-mother-sho... 62.0 27.0

5 rows × 29 columns

This table is rather big, so we will need to do some cleaning and tidying before we can start our analysis.

Firstly, we won't need all the data in this table. According to the dataset, some of the columns are not required - and thus, may contain NaN values. We don't want this as it will make our analysis more difficult than it needs to be. Out of the 29 columns, only 9 are required. That being said, we don't want to remove all of these unreqired columns, as some also contain value information we will need. The columns we will be removing are those that are not required and necesary for this analysis.

The following columns will be removed:

  • source_url
  • congressional_district
  • location_description
  • notes
  • participant_name
  • sources
  • state_house_district
  • state_senate_district

Here is the result:

In [182]:
columns_to_remove = [
    "source_url",
    "congressional_district",
    "location_description",
    "notes",
    "participant_name",
    "sources",
    "state_house_district",
    "state_senate_district",
]
data = data.drop(columns = columns_to_remove)
data.head()
Out[182]:
incident_id date state city_or_county address n_killed n_injured incident_url incident_url_fields_missing gun_stolen ... incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 http://www.gunviolencearchive.org/incident/461105 False NaN ... Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 http://www.gunviolencearchive.org/incident/460726 False NaN ... Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 http://www.gunviolencearchive.org/incident/478855 False 0::Unknown||1::Unknown ... Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 http://www.gunviolencearchive.org/incident/478925 False NaN ... Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 http://www.gunviolencearchive.org/incident/478959 False 0::Unknown||1::Unknown ... Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

5 rows × 21 columns

Secondly, we need to remove columns that were well-formed but are either unncecsary or contain sensitive information, like an address. We want this analysis to remain as anonymous as possible, and we want to respect those who were affected by these incidents.

We will handle NaN values on a per-situation basis. Pandas allows us to deal with these situations by offering functions like isnull() which checks if a row of data contains any NaNs. With this, we can continue our analysis without much trouble.

Here is the final result, and the data we will be using in the rest of the analysis:

In [183]:
labels = ["address", "incident_url", "incident_url_fields_missing"]
data = data.drop(columns = labels)
data.head()
Out[183]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01-01 Pennsylvania Mckeesport 0 4 NaN NaN Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01-01 California Hawthorne 1 3 NaN NaN Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01-01 Ohio Lorain 1 3 0::Unknown||1::Unknown 0::Unknown||1::Unknown Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01-05 Colorado Aurora 4 0 NaN NaN Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01-07 North Carolina Greensboro 2 2 0::Unknown||1::Unknown 0::Handgun||1::Handgun Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

Now that our data has been cleaned up, it's time to explain what we are looking at. This dataset tracked every since recorded incident of gun violence between early 2013 and early 2018 in the United States. It contains all the critical information we need to understand each incident that occured, such as where and when it happened, who was involved, and what the outcome was. Below is a summary of each column and what it tells us about the incident.

  • date: when the incident occured
  • state: what state the incident occured in
  • city_or_county: what city or county the incident occured in
  • n_killed: how many people were killed in the incident
  • n_injured: how many people were injured in the incident
  • gun_stolen: whether or not the gun/guns used were stolen
  • gun_type: what type of gun/guns were used
  • incident_characteristics: specific details about the incident
  • latitude: geographic latitude of the incident
  • longitude: geographic longitude of the incident
  • n_guns_involved: how many guns involved in the incident
  • participant_age: a breakdown of each participant's age
  • participant_age_group: a breakdown of each participant's age group
  • participant_gender: a breakdown of each participant's gender
  • participant_relationship: a breakdown of each participant's relationship to other participants
  • participant_status: a breakdown of the outcome of each participant
  • participant_type: a breakdown of each participant's role in the incident

Part 3 - Analysis¶

Graphs¶

To begin our analysis, we want to get a good understanding of the data.

In [184]:
import matplotlib.pyplot as plt
import seaborn as sns

Distribution of Fatalities in Mass Shootings¶

In [216]:
frequencies = {}
for index, row in data.iterrows():
    if row["n_killed"] not in frequencies:
        frequencies[row["n_killed"]] = 1
    else:
        frequencies[row["n_killed"]] += 1
for i in range(4):
    frequencies.pop(i, None)
In [223]:
plt.bar(frequencies.keys(), frequencies.values(), width = 0.7)
plt.xlim([0, 27])
plt.xlabel("Number of Fatalities")
plt.ylabel("Frequency")
plt.title("Distribution of Fatalities in Mass Shootings")
plt.show()

Frequency of Different Gun Types Used in Shootings¶

In [215]:
gun_types = {
    "Handgun" : 0,
    "Rifle" : 0,
    "Shotgun" : 0
}
for index, row in data.iterrows():
    if not pd.isnull(row["gun_type"]):
        gun_types["Handgun"] += row["gun_type"].count("Handgun")
        gun_types["Rifle"] += row["gun_type"].count("Rifle")
        gun_types["Shotgun"] += row["gun_type"].count("Shotgun")
In [214]:
plt.bar(gun_types.keys(), gun_types.values())
plt.xlabel("Gun Type")
plt.ylabel("Frequency")
plt.title("Frequency of Different Gun Types Used in Shootings")
plt.show()

Male Verses Deaths Involvement in Gun Violence¶

In [205]:
male_vs_female = {
    "Child 0-11" : [0, 0],
    "Teen 12-17" : [0, 0],
    "Adult 18+" : [0, 0]
}

for index, row in data.iterrows():
    if not pd.isnull(row["participant_gender"]) and not pd.isnull(row["participant_age_group"]):
        tokens_gender = row["participant_gender"].split("||")
        tokens_gender = [e[3:] for e in tokens_gender]
        tokens_age_grp = row["participant_age_group"].split("||")
        tokens_age_grp = [e[3:] for e in tokens_age_grp]
        result = list(zip(tokens_gender, tokens_age_grp))
        for pair in result:
            if pair[0] == "Male":
                if pair[1] == "Child 0-11":
                    male_vs_female["Child 0-11"][0] += 1
                elif pair[1] == "Teen 12-17":
                    male_vs_female["Teen 12-17"][0] += 1
                elif pair[1] == "Adult 18+":
                    male_vs_female["Adult 18+"][0] += 1
            elif pair[0] == "Female":
                if pair[1] == "Child 0-11":
                    male_vs_female["Child 0-11"][1] += 1
                elif pair[1] == "Teen 12-17":
                    male_vs_female["Teen 12-17"][1] += 1
                elif pair[1] == "Adult 18+":
                    male_vs_female["Adult 18+"][1] += 1
In [213]:
labels = ["Adult", "Teen", "Child"]
male_data = [male_vs_female["Adult 18+"][0], male_vs_female["Teen 12-17"][0], male_vs_female["Child 0-11"][0]]
female_data = [male_vs_female["Adult 18+"][1], male_vs_female["Teen 12-17"][1], male_vs_female["Child 0-11"][1]]

x_axis = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
fig.set_figwidth(10)
fig.set_figheight(8)
rects1 = ax.bar(x_axis - width/2, male_data, width, label = "Male")
rects2 = ax.bar(x_axis + width/2, female_data, width, label = "Female")

ax.set_xlabel("Age Group")
ax.set_ylabel("Amount of Involvement")
ax.set_ylim([0, 275000])
ax.set_title("Male Verses Female Involvement in Gun Violence")
ax.set_xticks(x_axis, labels)
ax.legend()
ax.bar_label(rects1, padding = 3)
ax.bar_label(rects2, padding = 3)
plt.show()

Mean Age of Participants Between 15 and 75 Verses Lethality¶

Lethality is calculated using the following formula:

$ 2* Participants\ Killed + 1.5 * Participants\ Injured $

In [208]:
def mean_age_of_participants(row):
    if pd.isnull(row):
        return "Invalid"
    ages = {
        k : 0 for k in range(15, 75)
    }
    for age in ages.keys():
        count = row.count(str(age))
        ages[age] += count
    lst = []
    for key, value in ages.items():
        if key * value != 0:
            lst.append(key * value)
    sum_of_ages, num_of_ages = float(sum(lst)), float(len(lst))
    if sum_of_ages == 0:
        return "Invalid"
    else:
        return sum_of_ages / num_of_ages
raw, filtered = [], []
for index, row in data.iterrows():
    [mean_age, lethality] = mean_age_of_participants(row["participant_age"]), \
    float(((2 * row["n_killed"]) + (1.5 * row["n_injured"]))) 
    raw.append([mean_age, lethality]) 
for entry in raw:
    if entry[0] != "Invalid":
        filtered.append(entry)
In [207]:
x_data, y_data = [], []
for entry in filtered:
    if entry[0] < 75 and entry[1] < 100:
        x_data.append(entry[0])
        y_data.append(entry[1])
[slope, intercept] = np.polyfit(x_data, y_data, 1)
plt.figure(figsize = (10, 8))
plt.scatter(x_data, y_data, s = 30, edgecolor = "black")
plt.xlabel("Mean Age of Participants")
plt.ylabel("Measure of Lethality")
plt.title("Mean Age of Participants Between 15 and 75 Years Old In Shootings Verses Lethality")
plt.plot(np.asarray(x_data), slope * np.asarray(x_data) + intercept, color = "orange")
plt.show()

Maps¶

One good way to view this data set is by generating a map. To do this first we get a geojson file containing the relevant infomation for each state. Then we count all entries by state and add it. This way we can graph both together.

In [189]:
# Getting GeoJson of US states from the folium and saving as geopandas(so we can add GeoJson tooltips)
# Source: https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json
import geopandas as gpd

state_geo = gpd.read_file("data/us-states.json")
In [190]:
# Summing up incidents per state
incident_count = data["state"].value_counts().reset_index()
incident_count.columns = ["name", "count"]
# Then merging since folium only does one data source for GeoJson
state_geo_count = state_geo.merge(incident_count, on = "name")

Now that we have a valid dataframe we need to create our maps.

In [191]:
from folium import Map, Choropleth
from folium.features import GeoJson, GeoJsonTooltip

total_shootings_by_state_map = Map(location = [43, -102], zoom_start = 4)

Choropleth(
    geo_data = state_geo,
    data = incident_count,
    bins = 9,
    columns = ["name", "count"],
    key_on = "feature.properties.name",
    legend_name = "Total shootings in state from 2013-2018",
    fill_color = "YlOrRd",
    fill_opacity = 0.7,
    line_opacity = 0.5,
    reset = True,
).add_to(total_shootings_by_state_map)
Out[191]:
<folium.features.Choropleth at 0x7ff0b1e49850>

That last cell made a map, then created a choropleth layer using the state geojson and counts.

In [192]:
style = lambda x: {
    "fillColor": "#ffffff",
    "color": "#000000",
    "fillOpacity": 0.1,
    "weight": 0.1,
}

highlight = lambda x: {
    "fillColor": "#000000",
    "color": "#000000",
    "fillOpacity": 0.30,
    "weight": 0.1,
}

gjson = GeoJson(
    data = state_geo_count,
    style_function = style,
    highlight_function = highlight,
    control = False,
    tooltip = GeoJsonTooltip(
        fields = ["name", "count"],
        aliases = ["State", "Shootings"],
    ),
)
total_shootings_by_state_map.add_child(gjson)
total_shootings_by_state_map.keep_in_front(gjson)

We create 3 things in this. First we create a highlight and style function, which just specify the colors and opacity for their namesake. Then the geojson object, which applies the functions and creates a tooltip (or popup) with the relevant info when you hover over a state.

In [193]:
# Showing the map
total_shootings_by_state_map
Out[193]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Next we will make a time based heatmap to show how shooting locations have changed over time. To preprocess we will make a copy of dataframe, cut dates down to yyyy-mm format, then drop na columns in required fields.

In [194]:
heatmap_df = data.copy()
heatmap_df["date"] = heatmap_df.apply(lambda row: row["date"][:7], axis = 1).sort_values(
    ascending=True
)
heatmap_df = heatmap_df.dropna(subset = ["latitude", "longitude"])
heatmap_df.head()
Out[194]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01 Pennsylvania Mckeesport 0 4 NaN NaN Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01 California Hawthorne 1 3 NaN NaN Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01 Ohio Lorain 1 3 0::Unknown||1::Unknown 0::Unknown||1::Unknown Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01 Colorado Aurora 4 0 NaN NaN Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01 North Carolina Greensboro 2 2 0::Unknown||1::Unknown 0::Handgun||1::Handgun Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

Now we must make the time index and group all latitude/logitude pairs that occured within each month.

In [195]:
heat_data = []
for _, df in heatmap_df.groupby("date"):
    heat_data.append([[row["latitude"], row["longitude"]] for _, row in df.iterrows()])
time_index = list(heatmap_df["date"].sort_values().unique())

Now we must make our actual map

In [196]:
from folium.plugins import HeatMapWithTime

heatmap = Map(location = [43, -102], zoom_start = 4)

HeatMapWithTime(
    heat_data,
    index = time_index,
    radius = 10,
    auto_play = False,
    speed_step = 1,
    min_speed = 1,
).add_to(heatmap)
Out[196]:
<folium.plugins.heat_map_withtime.HeatMapWithTime at 0x7ff0b464d4f0>
In [197]:
heatmap
Out[197]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]: